1 | Importing Libraries and Loading dataset¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import xgboost as xgb
colors = ['#ffcd94', '#eac086', '#ffad60', '#ffe39f', '#ffd700', '#ff8c00', '#ff6347', '#deb887', '#f4a460', '#cd853f']
sns.set_palette(sns.color_palette(colors))
df = pd.read_csv(r"C:\Users\HP\Downloads\bodyfat .csv")
df.head()
| Density | BodyFat | Age | Weight | Height | Neck | Chest | Abdomen | Hip | Thigh | Knee | Ankle | Biceps | Forearm | Wrist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0708 | 12.3 | 23 | 154.25 | 67.75 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
| 1 | 1.0853 | 6.1 | 22 | 173.25 | 72.25 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
| 2 | 1.0414 | 25.3 | 22 | 154.00 | 66.25 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
| 3 | 1.0751 | 10.4 | 26 | 184.75 | 72.25 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
| 4 | 1.0340 | 28.7 | 24 | 184.25 | 71.25 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 |
👉 | About the dataset
Dataset Overview: Body Fat Measurements¶
Context:¶
This dataset provides estimates of body fat percentage determined through underwater weighing, alongside various body circumference measurements for 252 men. The goal is to develop predictive models for estimating body fat based on simpler and less invasive measurements.
Educational Use:¶
This dataset is ideal for demonstrating multiple regression techniques. Since accurately measuring body fat through underwater weighing is both inconvenient and costly, this dataset helps illustrate how to estimate body fat using more accessible body circumference measurements.r
Measurement Standards: Measurements follow the standards outlined in Benhke and Wilmore (1974), pages 45-48. For instance, the abdomen 2 circumference is measured laterally at the iliac crests and anteriorly at the umbilicus.
Application:¶
These data are used to produce predictive equations for lean body weight as discussed in the abstract "Generalized Body Composition Prediction Equation for Men Using Simple Measurement Techniques" by Penrose, Nelson, and Fisher, published in Medicine and Science in Sports and Exercise, vol. 17, no. 2, April 1985, p. 189. The predictive equations were developed from the first 143 of the 252 cases provided in this dataset.
2 | Understanding Our Data¶
👉 |Shape
#What is the shape of the dataset?
df.shape
(252, 15)
👉 | Information
# Extract information about DataFrame
df_info = pd.DataFrame({
'Non-Null Count': df.notnull().sum(),
'Data Type': df.dtypes
})
# Apply stylish formatting with custom colors
styled_df_info = (
df_info.style
.set_properties(**{
'background-color': 'black', # Background color for the entire table
'color': '#eac086', # Text color
'border': '1px solid black', # Border color
'padding': '8px' # Padding for cells
})
.set_caption('DataFrame Information: Attributes and Data Types') # Add a title to the table
.set_table_styles([
{'selector': 'th', 'props': [('background-color', '#eac086')]}, # Bad heading background color
])
)
# Display the styled DataFrame
styled_df_info
| Non-Null Count | Data Type | |
|---|---|---|
| Density | 252 | float64 |
| BodyFat | 252 | float64 |
| Age | 252 | int64 |
| Weight | 252 | float64 |
| Height | 252 | float64 |
| Neck | 252 | float64 |
| Chest | 252 | float64 |
| Abdomen | 252 | float64 |
| Hip | 252 | float64 |
| Thigh | 252 | float64 |
| Knee | 252 | float64 |
| Ankle | 252 | float64 |
| Biceps | 252 | float64 |
| Forearm | 252 | float64 |
| Wrist | 252 | float64 |
#Some analysis on the numerical columns
df.describe()
| Density | BodyFat | Age | Weight | Height | Neck | Chest | Abdomen | Hip | Thigh | Knee | Ankle | Biceps | Forearm | Wrist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 |
| mean | 1.055574 | 19.150794 | 44.884921 | 178.924405 | 70.148810 | 37.992063 | 100.824206 | 92.555952 | 99.904762 | 59.405952 | 38.590476 | 23.102381 | 32.273413 | 28.663889 | 18.229762 |
| std | 0.019031 | 8.368740 | 12.602040 | 29.389160 | 3.662856 | 2.430913 | 8.430476 | 10.783077 | 7.164058 | 5.249952 | 2.411805 | 1.694893 | 3.021274 | 2.020691 | 0.933585 |
| min | 0.995000 | 0.000000 | 22.000000 | 118.500000 | 29.500000 | 31.100000 | 79.300000 | 69.400000 | 85.000000 | 47.200000 | 33.000000 | 19.100000 | 24.800000 | 21.000000 | 15.800000 |
| 25% | 1.041400 | 12.475000 | 35.750000 | 159.000000 | 68.250000 | 36.400000 | 94.350000 | 84.575000 | 95.500000 | 56.000000 | 36.975000 | 22.000000 | 30.200000 | 27.300000 | 17.600000 |
| 50% | 1.054900 | 19.200000 | 43.000000 | 176.500000 | 70.000000 | 38.000000 | 99.650000 | 90.950000 | 99.300000 | 59.000000 | 38.500000 | 22.800000 | 32.050000 | 28.700000 | 18.300000 |
| 75% | 1.070400 | 25.300000 | 54.000000 | 197.000000 | 72.250000 | 39.425000 | 105.375000 | 99.325000 | 103.525000 | 62.350000 | 39.925000 | 24.000000 | 34.325000 | 30.000000 | 18.800000 |
| max | 1.108900 | 47.500000 | 81.000000 | 363.150000 | 77.750000 | 51.200000 | 136.200000 | 148.100000 | 147.700000 | 87.300000 | 49.100000 | 33.900000 | 45.000000 | 34.900000 | 21.400000 |
👉 | Null Values Handling
# Calculate the number of null values and their percentages
null_counts = df.isnull().sum()
total_rows = len(df)
null_percentages = (null_counts / total_rows) * 100
# Create a DataFrame to display the counts and percentages
null_summary = pd.DataFrame({
'Null Values': null_counts,
'Percentage': null_percentages
})
# Apply stylish formatting with custom colors
styled_null_summary = (
null_summary.style
.format({'Percentage': '{:.2f}%'}) # Format percentage to two decimal places
.background_gradient(cmap='coolwarm', subset=['Percentage']) # Apply a gradient to the 'Percentage' column
.highlight_max(subset=['Null Values'], color='lightcoral') # Highlight the row with the maximum null values
.set_caption('Summary of Null Values and Their Percentages') # Add a title to the table
.set_table_styles([
{'selector': 'thead th', 'props': [('background-color', '#eac086'), # Header background color
('color', 'black'), # Header text color
('font-weight', 'bold')]},
{'selector': 'tbody tr:hover', 'props': [('background-color', '#eac086')]}, # Hover effect with background color
{'selector': 'tbody td', 'props': [('background-color', 'black'), # Table body background color
('color', '#eac086'), # Table body text color
('border', '1px solid #eac086'), # Border color
('padding', '8px')]}
])
)
# Display the styled DataFrame
styled_null_summary
| Null Values | Percentage | |
|---|---|---|
| Density | 0 | 0.00% |
| BodyFat | 0 | 0.00% |
| Age | 0 | 0.00% |
| Weight | 0 | 0.00% |
| Height | 0 | 0.00% |
| Neck | 0 | 0.00% |
| Chest | 0 | 0.00% |
| Abdomen | 0 | 0.00% |
| Hip | 0 | 0.00% |
| Thigh | 0 | 0.00% |
| Knee | 0 | 0.00% |
| Ankle | 0 | 0.00% |
| Biceps | 0 | 0.00% |
| Forearm | 0 | 0.00% |
| Wrist | 0 | 0.00% |
Great we have no null values in the dataset!
Great we have no duplicate values in the dataset!
3 | Exploratory Data Analysis¶
👉 | Plotting The Features
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as stats
warnings.filterwarnings('ignore')
# Customize the color palette
colors = ['#eac086', '#ffcd94', '#e57373']
# Create subplots
fig, ax = plt.subplots(15, 3, figsize=(30, 90))
for index, column in enumerate(df.select_dtypes(include='number').columns):
# Distribution Plot with KDE
sns.histplot(df[column], kde=True, color=colors[0], alpha=0.9, ax=ax[index, 0], bins=30, edgecolor='black')
ax[index, 0].set_title(f'Distribution Plot of {column}', fontsize=14, weight='bold')
ax[index, 0].set_xlabel(column, fontsize=12)
ax[index, 0].set_ylabel('Frequency', fontsize=12)
ax[index, 0].grid(True)
# Boxplot
sns.boxplot(x=df[column], ax=ax[index, 1], color=colors[1], saturation=0.9)
ax[index, 1].set_title(f'Box Plot of {column}', fontsize=14, weight='bold')
ax[index, 1].set_xlabel(column, fontsize=12)
ax[index, 1].grid(True)
# Q-Q Plot for Normality Check
stats.probplot(df[column].dropna(), plot=ax[index, 2])
ax[index, 2].get_lines()[1].set_color(colors[2]) # Line of the Q-Q plot
ax[index, 2].get_lines()[1].set_linewidth(2)
ax[index, 2].set_title(f'Q-Q Plot of {column}', fontsize=14, weight='bold')
ax[index, 2].grid(True)
# Improve overall layout and add a main title
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.subplots_adjust(top=0.95, hspace=0.4)
plt.suptitle("Visualizing Continuous Columns", fontsize=50, weight='bold', color='black')
plt.show()
Observations¶
- The dataset has some outliers.
- Some columns such as - Height,Ankle,Age etc are skewed
We see Ankle , Hip , Weight,Height are the most skewed columns
👉 | Handling Skewness
# Step 1: Calculate Skewness Before Transformation
skewness_before = df.skew(axis=0).sort_values()
skewness_df_before = pd.DataFrame(skewness_before, columns=['Skewness Before'])
# Step 2: Apply Transformations and Store the Transformed Features
transformed_features = {}
skewness_after = []
for col in df.select_dtypes(include='number').columns:
# Check if column values are all positive (required for Box-Cox transformation)
if (df[col] > 0).all():
transformed_data, fitted_lambda = boxcox(df[col].dropna())
transformed_features[col] = transformed_data # Store transformed data
skewness_after.append({'Column': col, 'Skewness After': skew(transformed_data)})
else:
# Apply log transformation for non-positive values (handle zeros by adding a small constant)
transformed_data = np.log1p(df[col] - df[col].min() + 1)
transformed_features[col] = transformed_data
skewness_after.append({'Column': col, 'Skewness After': skew(transformed_data)})
# Convert skewness after transformation to DataFrame
skewness_df_after = pd.DataFrame(skewness_after).set_index('Column')
# Step 3: Use Pandas Styling to Display Skewness Tables Before and After Transformation
styled_skewness_before = (
skewness_df_before.style
.background_gradient(cmap='coolwarm')
.set_properties(**{
'background-color': '#eac086', # Set custom background color
'color': 'black', # Set text color to black
'border': '1px solid black', # Border color
'padding': '8px' # Padding for better readability
})
.set_caption('------- Column Skewness Before Transformation ------')
)
styled_skewness_after = (
skewness_df_after.style
.background_gradient(cmap='coolwarm')
.set_properties(**{
'background-color': '#eac086', # Set custom background color
'color': 'black', # Set text color to black
'border': '1px solid black', # Border color
'padding': '8px' # Padding for better readability
})
.set_caption('---- Column Skewness After Transformation ---')
)
# Display the styled DataFrames
display(styled_skewness_before)
| Skewness Before | |
|---|---|
| Height | -5.384987 |
| Forearm | -0.219333 |
| Density | -0.020176 |
| BodyFat | 0.146353 |
| Wrist | 0.281614 |
| Age | 0.283521 |
| Biceps | 0.285530 |
| Knee | 0.516744 |
| Neck | 0.552620 |
| Chest | 0.681556 |
| Thigh | 0.821210 |
| Abdomen | 0.838418 |
| Weight | 1.205263 |
| Hip | 1.497127 |
| Ankle | 2.255134 |
display(styled_skewness_after)
| Skewness After | |
|---|---|
| Column | |
| Density | -0.002732 |
| BodyFat | -1.184503 |
| Age | -0.028675 |
| Weight | -0.012174 |
| Height | 0.160706 |
| Neck | -0.016034 |
| Chest | -0.005164 |
| Abdomen | -0.004243 |
| Hip | -0.045212 |
| Thigh | -0.015753 |
| Knee | -0.005937 |
| Ankle | -0.113418 |
| Biceps | 0.000303 |
| Forearm | 0.028126 |
| Wrist | -0.001051 |
👉 | new DataFrame Improved
# Step 4: Using Transformed Features for Further Analysis
# Create a new DataFrame with transformed features
df2 = pd.DataFrame(transformed_features)
df2.columns = [f"{col}_transformed" for col in df2.columns]
df2.head()
| Density_transformed | BodyFat_transformed | Age_transformed | Weight_transformed | Height_transformed | Neck_transformed | Chest_transformed | Abdomen_transformed | Hip_transformed | Thigh_transformed | Knee_transformed | Ankle_transformed | Biceps_transformed | Forearm_transformed | Wrist_transformed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.071748 | 2.660260 | 7.566358 | 1.748828 | 5.309691e+09 | 1.334254 | 0.671926 | 1.056564 | 0.384054 | 0.843417 | 0.749983 | 0.296298 | 3.995944 | 245.395629 | 1.232805 |
| 1 | 0.086672 | 2.091864 | 7.356642 | 1.756517 | 7.673162e+09 | 1.339406 | 0.671932 | 1.056142 | 0.384054 | 0.843375 | 0.749983 | 0.296299 | 3.932654 | 270.859227 | 1.241166 |
| 2 | 0.041726 | 3.306887 | 7.356642 | 1.748717 | 4.670870e+09 | 1.328781 | 0.671959 | 1.057053 | 0.384054 | 0.843500 | 0.750324 | 0.296300 | 3.857367 | 210.142920 | 1.228695 |
| 3 | 0.076166 | 2.517696 | 8.169442 | 1.760570 | 7.673162e+09 | 1.337009 | 0.672025 | 1.056785 | 0.384054 | 0.843568 | 0.749983 | 0.296299 | 4.012360 | 279.602364 | 1.241166 |
| 4 | 0.034220 | 3.424263 | 7.771548 | 1.760402 | 7.084643e+09 | 1.329820 | 0.671976 | 1.058933 | 0.384054 | 0.843964 | 0.750934 | 0.296300 | 4.004175 | 250.396172 | 1.237475 |
# Extract information about DataFrame
df_info = pd.DataFrame({
'Non-Null Count': df2.notnull().sum(),
'Data Type': df2.dtypes
})
# Apply stylish formatting with custom colors
styled_df_info = (
df_info.style
.set_properties(**{
'background-color': 'black', # Background color for the entire table
'color': '#eac086', # Text color
'border': '1px solid black', # Border color
'padding': '8px' # Padding for cells
})
.set_caption('DataFrame Information: Attributes and Data Types') # Add a title to the table
.set_table_styles([
{'selector': 'th', 'props': [('background-color', '#eac086')]}, # Bad heading background color
])
)
# Display the styled DataFrame
styled_df_info
| Non-Null Count | Data Type | |
|---|---|---|
| Density_transformed | 252 | float64 |
| BodyFat_transformed | 252 | float64 |
| Age_transformed | 252 | float64 |
| Weight_transformed | 252 | float64 |
| Height_transformed | 252 | float64 |
| Neck_transformed | 252 | float64 |
| Chest_transformed | 252 | float64 |
| Abdomen_transformed | 252 | float64 |
| Hip_transformed | 252 | float64 |
| Thigh_transformed | 252 | float64 |
| Knee_transformed | 252 | float64 |
| Ankle_transformed | 252 | float64 |
| Biceps_transformed | 252 | float64 |
| Forearm_transformed | 252 | float64 |
| Wrist_transformed | 252 | float64 |
👉 |Corrleation Matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA
from scipy.stats import pearsonr, skew
from scipy.special import boxcox1p
# Step 1: Exploratory Data Analysis (EDA)
plt.figure(figsize=(14, 10))
sns.heatmap(df2.corr(), annot=True, cmap=sns.diverging_palette(230, 20, as_cmap=True), center=0, linewidths=0.5, square=True)
plt.title('Correlation Matrix of Features')
plt.show()
👉 | PairPlots
# Set the plotting style and palette
sns.set_palette(sns.color_palette(["#000000", "#eac086"]))
# Create the pair plot
plt.figure(figsize=(15, 15)) # Increase the figure size
sns.pairplot(df2[['Density_transformed', 'BodyFat_transformed', 'Age_transformed',
'Weight_transformed', 'Height_transformed', 'Neck_transformed',
'Chest_transformed', 'Abdomen_transformed', 'Hip_transformed',
'Thigh_transformed', 'Knee_transformed', 'Ankle_transformed',
'Biceps_transformed', 'Forearm_transformed', 'Wrist_transformed']],
diag_kind='kde', markers='o', height=3, aspect=1, kind='scatter')
# Set the title and show the plot
plt.suptitle('Pair Plot of Features')
plt.show()
<Figure size 1500x1500 with 0 Axes>
correlations = df2.corr()['BodyFat_transformed'].sort_values(ascending=False)
# Filter features highly correlated with BodyFat
significant_features = correlations[abs(correlations) > 0.3].index.tolist()
# Create a DataFrame with the correlation values
corr_df = pd.DataFrame(correlations).reset_index()
corr_df.columns = ['Feature', 'Correlation']
# Function to apply color formatting based on correlation value
def color_corr(val):
color = '#4CAF50' if val > 0.3 else ('#F44336' if val < -0.3 else '#ffffff') # Green for positive, Red for negative, White for neutral
return f'background-color: {color}; color: black'
# Apply the styling
styled_corr_df = corr_df.style.map(color_corr, subset=['Correlation']).set_table_styles(
[{'selector': 'thead',
'props': [('background-color', '#eac086'), ('color', 'black'), ('font-weight', 'bold')]}]
).set_properties(**{'background-color': 'black', 'color': 'white'})
# Display the styled DataFrame
styled_corr_df
| Feature | Correlation | |
|---|---|---|
| 0 | BodyFat_transformed | 1.000000 |
| 1 | Abdomen_transformed | 0.790485 |
| 2 | Chest_transformed | 0.678641 |
| 3 | Hip_transformed | 0.624756 |
| 4 | Weight_transformed | 0.618177 |
| 5 | Thigh_transformed | 0.576846 |
| 6 | Knee_transformed | 0.512810 |
| 7 | Biceps_transformed | 0.491864 |
| 8 | Neck_transformed | 0.464909 |
| 9 | Forearm_transformed | 0.357903 |
| 10 | Wrist_transformed | 0.345377 |
| 11 | Ankle_transformed | 0.295833 |
| 12 | Age_transformed | 0.281622 |
| 13 | Height_transformed | -0.000258 |
| 14 | Density_transformed | -0.945249 |
👉 | Separating Independent and Dependent Variables
X = df2.drop(columns=['BodyFat_transformed'], axis=1)
y = df2['BodyFat_transformed']
4 | Feature Engineering and Preprocessing¶
👉 | Adding Some Features
# Step 2: Feature Engineering
X['BMI'] = X['Weight_transformed'] / (X['Height_transformed'] / 100) ** 2 # Body Mass Index
X['WaistToHipRatio'] = X['Abdomen_transformed'] / X['Hip_transformed'] # Waist-to-Hip Ratio
X['BodySurfaceArea'] = 0.007184 * (X['Height_transformed'] ** 0.725) * (X['Weight_transformed'] ** 0.425) # Body Surface Area
X['AgeSquared'] = X['Age_transformed'] ** 2 # Age squared to capture non-linear effects
X['AbdomenToChestRatio'] = X['Abdomen_transformed'] / X['Chest_transformed'] # Abdomen-to-Chest Ratio
# Step 7: Domain-Specific Insights and Custom Feature Extraction
X['UpperBodyFat'] = X['Neck_transformed'] + X['Chest_transformed'] + X['Biceps_transformed']
X['LowerBodyFat'] = X['Thigh_transformed'] + X['Knee_transformed'] + X['Ankle_transformed']
X['ArmFatIndex'] = (X['Biceps_transformed'] + X['Forearm_transformed']) / X['Wrist_transformed']
👉 | Dropping Features
X.drop(['Weight_transformed', 'Neck_transformed', 'Biceps_transformed', 'Knee_transformed', 'Ankle_transformed', 'Forearm_transformed', 'Wrist_transformed' ,'Height_transformed','Abdomen_transformed','Chest_transformed','Hip_transformed','Thigh_transformed'],axis=1,inplace=True)
X.columns
Index(['Density_transformed', 'Age_transformed', 'BMI', 'WaistToHipRatio',
'BodySurfaceArea', 'AgeSquared', 'AbdomenToChestRatio', 'UpperBodyFat',
'LowerBodyFat', 'ArmFatIndex'],
dtype='object')
X.head(2)
| Density_transformed | Age_transformed | BMI | WaistToHipRatio | BodySurfaceArea | AgeSquared | AbdomenToChestRatio | UpperBodyFat | LowerBodyFat | ArmFatIndex | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.071748 | 7.566358 | 6.203097e-16 | 2.751084 | 102378.568243 | 57.249780 | 1.572441 | 6.002123 | 1.889697 | 202.296087 |
| 1 | 0.086672 | 7.356642 | 2.983345e-16 | 2.749984 | 133952.229459 | 54.120175 | 1.571798 | 5.943992 | 1.889657 | 221.398142 |
👉 | Feature Extraction By Correlation
correlations = df2.corr()['BodyFat_transformed'].sort_values(ascending=False)
# Filter features highly correlated with BodyFat
significant_features = correlations[abs(correlations) > 0.3].index.tolist()
# Create a DataFrame with the correlation values
corr_df = pd.DataFrame(correlations).reset_index()
corr_df.columns = ['Feature', 'Correlation']
# Function to apply color formatting based on correlation value
def color_corr(val):
color = '#4CAF50' if val > 0.3 else ('#F44336' if val < -0.3 else '#ffffff') # Green for positive, Red for negative, White for neutral
return f'background-color: {color}; color: black'
# Apply the styling
styled_corr_df = corr_df.style.map(color_corr, subset=['Correlation']).set_table_styles(
[{'selector': 'thead',
'props': [('background-color', '#eac086'), ('color', 'black'), ('font-weight', 'bold')]}]
).set_properties(**{'background-color': 'black', 'color': 'white'})
# Display the styled DataFrame
styled_corr_df
| Feature | Correlation | |
|---|---|---|
| 0 | BodyFat_transformed | 1.000000 |
| 1 | Abdomen_transformed | 0.790485 |
| 2 | Chest_transformed | 0.678641 |
| 3 | Hip_transformed | 0.624756 |
| 4 | Weight_transformed | 0.618177 |
| 5 | Thigh_transformed | 0.576846 |
| 6 | Knee_transformed | 0.512810 |
| 7 | Biceps_transformed | 0.491864 |
| 8 | Neck_transformed | 0.464909 |
| 9 | Forearm_transformed | 0.357903 |
| 10 | Wrist_transformed | 0.345377 |
| 11 | Ankle_transformed | 0.295833 |
| 12 | Age_transformed | 0.281622 |
| 13 | Height_transformed | -0.000258 |
| 14 | Density_transformed | -0.945249 |
👉 | Feature Extraction By Recurvie Feature Elimination (RFE)
# Provided feature importances
rfe_features = [
'Density_transformed',
'Age_transformed',
'WaistToHipRatio',
'BodySurfaceArea',
'AgeSquared',
'AbdomenToChestRatio',
'UpperBodyFat',
'LowerBodyFat',
'ArmFatIndex'
]
rfe_importances = [
0.9481,
0.0004,
0.0098,
0.0081,
0.0014,
0.0113,
0.0056,
0.0111,
0.0041
]
# Create a DataFrame
df_rfe = pd.DataFrame({
'Selected Feature': rfe_features,
'Importance': rfe_importances
})
# Define styling function
def style_df(df):
return df.style.set_table_styles(
[{'selector': 'thead th',
'props': [('background-color', '#eac086'),
('color', 'black'),
('font-weight', 'bold'),
('text-align', 'center'),
('font-size', '14px')]},
{'selector': 'td',
'props': [('padding', '10px'),
('background-color', '#000000'),
('color', '#eac086'),
('text-align', 'center'),
('font-size', '12px')]},
{'selector': 'table',
'props': [('border-collapse', 'collapse'),
('width', '60%'),
('margin', '20px auto'),
('border', '2px solid #000000')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background-color', '#f9f9f9')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background-color', '#ffffff')]}]
).set_properties(**{'text-align': 'center'}).hide(axis='index')
# Apply styling to the DataFrame
styled_df_rfe = style_df(df_rfe)
# Display the styled DataFrame
styled_df_rfe
| Selected Feature | Importance |
|---|---|
| Density_transformed | 0.948100 |
| Age_transformed | 0.000400 |
| WaistToHipRatio | 0.009800 |
| BodySurfaceArea | 0.008100 |
| AgeSquared | 0.001400 |
| AbdomenToChestRatio | 0.011300 |
| UpperBodyFat | 0.005600 |
| LowerBodyFat | 0.011100 |
| ArmFatIndex | 0.004100 |
Here we have created 3 new columns namely -¶
- Bmi - Body Mass Index.
- ACratio - Abdomen Chest Ratio
- HTratio - Hip Thigh Ratio
This will help us reduce some of the problems caused due to multicollinearity
👉 | Removing Outliers
from sklearn.ensemble import IsolationForest
# Assuming X and y are your feature matrix and target vector
# Initialize the Isolation Forest model
iso_forest = IsolationForest(contamination=0.05, random_state=42) # Adjust contamination as needed
# Fit the model and predict outliers
outliers = iso_forest.fit_predict(X) == -1
# Filter out outliers from the dataset
X_clean = X[~outliers]
y_clean = y[~outliers]
# Output the number of rows before and after cleaning
original_rows = X.shape[0]
cleaned_rows = X_clean.shape[0]
rows_removed = original_rows - cleaned_rows
# Prepare data for styling
summary_data = {
'Metric': ['Original Number of Rows', 'Number of Rows After Removing Outliers', 'Number of Rows Removed'],
'Value': [original_rows, cleaned_rows, rows_removed]
}
# Convert to DataFrame
summary_df = pd.DataFrame(summary_data)
# Define a function to apply styling
def highlight_max(s):
is_max = s == s.max()
return ['background-color: #E6E6FA' if v else '' for v in is_max]
# Apply styling
styled_summary_df = summary_df.style.apply(highlight_max, axis=0).set_table_styles(
[{'selector': 'thead th',
'props': [('background-color', '#eac086'),
('color', 'black'),
('font-weight', 'bold')]},
{'selector': 'td',
'props': [('padding', '10px'),
('background-color', '#F5F5F5')]},
{'selector': 'table',
'props': [('border-collapse', 'collapse'),
('width', '50%')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background-color', '#FAFAFA')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background-color', '#FFFFFF')]}]
)
# Display the styled DataFrame
styled_summary_df
| Metric | Value | |
|---|---|---|
| 0 | Original Number of Rows | 252 |
| 1 | Number of Rows After Removing Outliers | 239 |
| 2 | Number of Rows Removed | 13 |
👉 | Splitting into train and test set
import pandas as pd
from sklearn.model_selection import train_test_split
# Use the cleaned dataset after outlier removal
X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, test_size=0.2, random_state=42)
# Prepare data for styling
summary_data = {
'Dataset': ['Training Features', 'Test Features', 'Training Labels', 'Test Labels'],
'Shape': [X_train.shape, X_test.shape, y_train.shape, y_test.shape]
}
# Convert to DataFrame
summary_df = pd.DataFrame(summary_data)
# Define a function to apply styling
def highlight_max(s):
is_max = s == s.max()
return ['background-color: #eac086' if v else '' for v in is_max]
# Apply styling
styled_summary_df = summary_df.style.apply(highlight_max, axis=0).set_table_styles(
[{'selector': 'thead th',
'props': [('background-color', '#eac086'),
('color', 'black'),
('font-weight', 'bold')]},
{'selector': 'td',
'props': [('padding', '10px'),
('background-color', '#000000'),
('color', 'white')]},
{'selector': 'table',
'props': [('border-collapse', 'collapse'),
('width', '50%')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background-color', '#f9f9f9')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background-color', '#ffffff')]}]
)
# Display the styled DataFrame
styled_summary_df
| Dataset | Shape | |
|---|---|---|
| 0 | Training Features | (191, 10) |
| 1 | Test Features | (48, 10) |
| 2 | Training Labels | (191,) |
| 3 | Test Labels | (48,) |
X_train_df = pd.DataFrame(X_train)
X_test_df = pd.DataFrame(X_test)
y_train_df = pd.DataFrame(y_train)
y_test_df = pd.DataFrame(y_test)
X_train_df.to_csv(r'C:\Users\HP\Downloads\X_train.csv', index=False)
X_test_df.to_csv(r'C:\Users\HP\Downloads\X_test.csv', index=False)
y_train_df.to_csv(r'C:\Users\HP\Downloads\y_train.csv', index=False)
y_test_df.to_csv(r'C:\Users\HP\Downloads\y_test.csv', index=False)
👉 | Applying Feature Scaling
from sklearn.preprocessing import MinMaxScaler
# Create a MinMaxScaler object
scaler = MinMaxScaler()
# Fit the scaler to the training data and transform both the training and testing data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
5 | Model Building¶
👉 | Metric Used: R2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
# Define the chosen models
chosen_models = {
'XGBRegressor': XGBRegressor(),
'RandomForestRegressor': RandomForestRegressor(),
'GradientBoostingRegressor': GradientBoostingRegressor()
}
# Create a DataFrame to display model details
model_names = list(chosen_models.keys())
model_instances = [model.__class__.__name__ for model in chosen_models.values()]
model_data = {
'Model Name': model_names,
'Model Instance': model_instances
}
model_df = pd.DataFrame(model_data)
def style_model_df(df):
return df.style.set_table_styles(
[{'selector': 'thead th',
'props': [('background-color', '#000000'),
('color', '#eac086'),
('font-weight', 'bold')]},
{'selector': 'td',
'props': [('padding', '10px'),
('background-color', '#eac086'),
('color', 'black')]},
{'selector': 'table',
'props': [('border-collapse', 'collapse'),
('width', '60%')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background-color', '#f9f9f9')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background-color', '#ffffff')]},
{'selector': 'th:first-child, td:first-child',
'props': [('border-right', '3px solid #000000')]},
{'selector': 'tr',
'props': [('border-bottom', '3px solid #000000')]} # Add a horizontal line after each row
]
).set_properties(**{'text-align': 'left'}).hide(axis='index')
# Apply styling to the DataFrame
styled_model_df = style_model_df(model_df)
# Display the styled DataFrame
styled_model_df
| Model Name | Model Instance |
|---|---|
| XGBRegressor | XGBRegressor |
| RandomForestRegressor | RandomForestRegressor |
| GradientBoostingRegressor | GradientBoostingRegressor |
# Parameter grid for XGBRegressor
param_grid_xgb = {
'n_estimators': [50, 100, 150, 200],
'learning_rate': [0.001, 0.01, 0.1, 0.2],
'max_depth': [3, 4, 5, 6],
'min_child_weight': [1, 3, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0]
}
# Parameter grid for RandomForestRegressor
param_grid_rf = {
'n_estimators': [50, 100, 150, 200],
'max_depth': [None, 10, 20, 30, 40],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
# Parameter grid for GradientBoostingRegressor
param_grid_gb = {
'n_estimators': [50, 100, 150, 200],
'learning_rate': [0.001, 0.01, 0.1, 0.2],
'max_depth': [3, 4, 5, 6],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'subsample': [0.8, 0.9, 1.0]
}
# Create model instances
xgb = XGBRegressor()
rf = RandomForestRegressor()
gb = GradientBoostingRegressor()
# Perform GridSearchCV for each model
grid_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid_xgb, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
grid_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
grid_gb = GridSearchCV(estimator=gb, param_grid=param_grid_gb, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
grid_xgb.fit(X_train, y_train)
GridSearchCV(cv=5,
estimator=XGBRegressor(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=None, m...
monotone_constraints=None,
multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None,
random_state=None, ...),
n_jobs=-1,
param_grid={'colsample_bytree': [0.6, 0.8, 1.0],
'learning_rate': [0.001, 0.01, 0.1, 0.2],
'max_depth': [3, 4, 5, 6],
'min_child_weight': [1, 3, 5],
'n_estimators': [50, 100, 150, 200],
'subsample': [0.6, 0.8, 1.0]},
scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
estimator=XGBRegressor(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=None, m...
monotone_constraints=None,
multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None,
random_state=None, ...),
n_jobs=-1,
param_grid={'colsample_bytree': [0.6, 0.8, 1.0],
'learning_rate': [0.001, 0.01, 0.1, 0.2],
'max_depth': [3, 4, 5, 6],
'min_child_weight': [1, 3, 5],
'n_estimators': [50, 100, 150, 200],
'subsample': [0.6, 0.8, 1.0]},
scoring='neg_mean_squared_error')XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)# Fit GridSearchCV
grid_rf.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
param_grid={'bootstrap': [True, False],
'max_depth': [None, 10, 20, 30, 40],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [50, 100, 150, 200]},
scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
param_grid={'bootstrap': [True, False],
'max_depth': [None, 10, 20, 30, 40],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [50, 100, 150, 200]},
scoring='neg_mean_squared_error')RandomForestRegressor()
RandomForestRegressor()
# Fit GridSearchCV
grid_gb.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=GradientBoostingRegressor(), n_jobs=-1,
param_grid={'learning_rate': [0.001, 0.01, 0.1, 0.2],
'max_depth': [3, 4, 5, 6],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [50, 100, 150, 200],
'subsample': [0.8, 0.9, 1.0]},
scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=GradientBoostingRegressor(), n_jobs=-1,
param_grid={'learning_rate': [0.001, 0.01, 0.1, 0.2],
'max_depth': [3, 4, 5, 6],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [50, 100, 150, 200],
'subsample': [0.8, 0.9, 1.0]},
scoring='neg_mean_squared_error')GradientBoostingRegressor()
GradientBoostingRegressor()
results = {
'Model': ['XGBRegressor', 'RandomForestRegressor', 'GradientBoostingRegressor'],
'Best Parameters': [grid_xgb.best_params_, grid_rf.best_params_, grid_gb.best_params_],
'Best Score': [grid_xgb.best_score_, grid_rf.best_score_, grid_gb.best_score_]
}
results_df = pd.DataFrame(results)
# Define a function to style the DataFrame with more details
def style_results_df(df):
return df.style.set_table_styles(
[{'selector': 'thead th',
'props': [('background-color', '#eac086'),
('color', 'black'),
('font-weight', 'bold'),
('text-align', 'center')]},
{'selector': 'td',
'props': [('padding', '10px'),
('background-color', '#000000'),
('color', 'white'),
('text-align', 'center')]},
{'selector': 'table',
'props': [('border-collapse', 'collapse'),
('width', '80%'),
('margin', '20px auto')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background-color', '#f9f9f9')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background-color', '#ffffff')]}]
).set_properties(**{'text-align': 'center'}).hide(axis='index')
# Apply styling to the results DataFrame
styled_results_df = style_results_df(results_df)
# Display the styled DataFrame
styled_results_df
| Model | Best Parameters | Best Score |
|---|---|---|
| XGBRegressor | {'colsample_bytree': 0.6, 'learning_rate': 0.2, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 200, 'subsample': 1.0} | -0.011358 |
| RandomForestRegressor | {'bootstrap': True, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50} | -0.010256 |
| GradientBoostingRegressor | {'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 200, 'subsample': 0.8} | -0.010351 |
# Print the best parameters of each model
print("Best parameters for XGBRegressor:")
print(grid_xgb.best_params_)
print("\nBest parameters for RandomForestRegressor:")
print(grid_rf.best_params_)
print("\nBest parameters for GradientBoostingRegressor:")
print(grid_gb.best_params_)
# Define the models with the best hyperparameters
xgb_params = grid_xgb.best_params_
rf_params = grid_rf.best_params_
gb_params = grid_gb.best_params_
Best parameters for XGBRegressor:
{'colsample_bytree': 0.6, 'learning_rate': 0.2, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 200, 'subsample': 1.0}
Best parameters for RandomForestRegressor:
{'bootstrap': True, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best parameters for GradientBoostingRegressor:
{'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 200, 'subsample': 0.8}
# Define the models with the best hyperparameters
xgb_model = XGBRegressor(**xgb_params)
rf_model = RandomForestRegressor(**rf_params)
gb_model = GradientBoostingRegressor(**gb_params)
# Fit the models
xgb_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
# Make predictions
y_pred_xgb = xgb_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error, median_absolute_error, explained_variance_score
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
def compute(model, X_train, y_train, X_test, y_test, hashmap):
"""
Train the model, make predictions, and compute evaluation metrics.
Parameters:
- model: The machine learning model to be evaluated.
- X_train: Training features.
- y_train: Training labels.
- X_test: Testing features.
- y_test: Testing labels.
- hashmap: Dictionary to store the model name and metrics.
Returns:
- None
"""
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Compute metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
medae = median_absolute_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)
# Update the hashmap with metrics
model_name = str(model).split('(')[0] # Remove parameters from model name
hashmap[model_name] = {
'R^2': r2,
'RMSE': rmse,
'MAE': mae,
'MAPE': mape,
'Median AE': medae,
'Explained Variance Score': evs
}
# Example usage with models
hashmap = {}
# Replace with your model instances
models = {
'XGBRegressor': XGBRegressor(),
'RandomForestRegressor': RandomForestRegressor(),
'GradientBoostingRegressor': GradientBoostingRegressor()
}
# Evaluate each model
for name, model in models.items():
compute(model, X_train, y_train, X_test, y_test, hashmap)
# Convert hashmap to DataFrame for better presentation
results_df = pd.DataFrame.from_dict(hashmap, orient='index').reset_index()
results_df.rename(columns={'index': 'Model'}, inplace=True)
# Style the results DataFrame
def style_results_df(df):
return df.style.set_table_styles(
[{'selector': 'thead th',
'props': [('background-color', '#eac086'),
('color', 'black'),
('font-weight', 'bold'),
('text-align', 'center'),
('font-size', '14px')]},
{'selector': 'td',
'props': [('padding', '10px'),
('background-color', '#000000'),
('color', '#eac086'),
('text-align', 'center'),
('font-size', '12px')]},
{'selector': 'table',
'props': [('border-collapse', 'collapse'),
('width', '80%'),
('margin', '20px auto'),
('border', '2px solid #000000')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background-color', '#f9f9f9')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background-color', '#ffffff')]}]
).set_properties(**{'text-align': 'center'}).hide(axis='index')
# Apply styling to the results DataFrame
styled_results_df = style_results_df(results_df)
# Display the styled DataFrame
styled_results_df
| Model | R^2 | RMSE | MAE | MAPE | Median AE | Explained Variance Score |
|---|---|---|---|---|---|---|
| XGBRegressor | 0.989803 | 0.047240 | 0.022924 | 0.008950 | 0.009602 | 0.989864 |
| RandomForestRegressor | 0.980255 | 0.065738 | 0.026357 | 0.010270 | 0.007266 | 0.980640 |
| GradientBoostingRegressor | 0.987876 | 0.051511 | 0.022400 | 0.008947 | 0.004594 | 0.987943 |
import pandas as pd
# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
'Actual Body_Fat': y_test,
'XGB Predicted': y_pred_xgb,
'RandomForest Predicted': y_pred_rf,
'GradientBoosting Predicted': y_pred_gb
})
# Style the DataFrame
def style_comparison_df(df):
return df.style.set_table_styles(
[{'selector': 'thead th',
'props': [('background-color', '#000000'),
('color', 'white'),
('font-weight', 'bold'),
('text-align', 'center'),
('font-size', '18px')]}, # Increased font size to 18px
{'selector': 'td',
'props': [('padding', '10px'),
('background-color', '#eac086 '),
('color', 'black'),
('text-align', 'center'),
('font-size', '14px')]}, # Increased font size to 14px
{'selector': 'table',
'props': [('border-collapse', 'collapse'),
('width', '80%'),
('margin', '20px auto'),
('border', '2px solid #000000')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background-color', '#f9f9f9')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background-color', '#ffffff')]},
{'selector': 'th, td',
'props': [('border-right', '1px solid #000000')]}, # Added vertical lines
{'selector': 'th:first-child, td:first-child',
'props': [('border-left', '1px solid #000000')]}, # Added vertical lines
{'selector': 'th:last-child, td:last-child',
'props': [('border-right', '1px solid #000000')]}, # Added vertical lines
]).set_properties(**{'text-align': 'center'}).hide(axis='index')
# Apply styling to the results DataFrame
styled_comparison_df = style_comparison_df(comparison_df)
# Display the styled DataFrame
styled_comparison_df
| Actual Body_Fat | XGB Predicted | RandomForest Predicted | GradientBoosting Predicted |
|---|---|---|---|
| 1.740466 | 1.923942 | 1.801761 | 1.773055 |
| 2.667228 | 2.683122 | 2.646584 | 2.632619 |
| 3.186353 | 3.209780 | 3.184591 | 3.183087 |
| 2.928524 | 2.930230 | 2.920177 | 2.925560 |
| 3.077312 | 3.063828 | 3.067740 | 3.072026 |
| 2.602690 | 2.617278 | 2.603371 | 2.619777 |
| 3.374169 | 3.340529 | 3.377351 | 3.363605 |
| 2.351375 | 2.358820 | 2.367517 | 2.376295 |
| 2.208274 | 2.186350 | 2.187794 | 2.162981 |
| 3.303217 | 3.326757 | 3.305812 | 3.305484 |
| 2.151762 | 2.188141 | 2.107588 | 2.105292 |
| 2.251292 | 2.298088 | 2.256393 | 2.362219 |
| 3.397858 | 3.383094 | 3.403739 | 3.395556 |
| 2.714695 | 2.639407 | 2.710762 | 2.697947 |
| 3.000720 | 2.982914 | 2.999048 | 3.003940 |
| 3.610918 | 3.610105 | 3.600507 | 3.615493 |
| 3.314186 | 3.288378 | 3.308686 | 3.306841 |
| 3.433987 | 3.423511 | 3.430143 | 3.429656 |
| 2.856470 | 2.871415 | 2.843550 | 2.841665 |
| 3.206803 | 3.216308 | 3.202033 | 3.200899 |
| 3.335770 | 3.293461 | 3.329212 | 3.330851 |
| 3.049273 | 3.052361 | 3.051418 | 3.055137 |
| 3.471966 | 3.486433 | 3.470359 | 3.471233 |
| 2.624669 | 2.545546 | 2.529972 | 2.475943 |
| 2.360854 | 2.405570 | 2.380210 | 2.303955 |
| 2.533697 | 2.625536 | 2.553749 | 2.558353 |
| 2.282382 | 2.421396 | 2.418296 | 2.476196 |
| 2.980619 | 3.003736 | 2.982452 | 2.984478 |
| 3.443618 | 3.462661 | 3.460553 | 3.486458 |
| 2.433613 | 2.356863 | 2.429167 | 2.291836 |
| 3.095578 | 3.100104 | 3.099773 | 3.095440 |
| 2.674149 | 2.597281 | 2.577352 | 2.466882 |
| 2.292535 | 2.288321 | 2.245968 | 2.240487 |
| 2.884801 | 2.924354 | 2.896750 | 2.889850 |
| 2.917771 | 2.953461 | 2.921842 | 2.918005 |
| 3.020425 | 2.666239 | 2.709060 | 2.726274 |
| 1.974081 | 1.984332 | 2.035196 | 2.065408 |
| 3.459466 | 3.421786 | 3.461100 | 3.457072 |
| 3.353407 | 3.358920 | 3.357636 | 3.351748 |
| 3.549617 | 3.515830 | 3.543443 | 3.511788 |
| 2.667228 | 2.660399 | 2.653350 | 2.613112 |
| 3.325036 | 3.315657 | 3.326879 | 3.322874 |
| 2.484907 | 2.601023 | 2.499396 | 2.506492 |
| 3.214868 | 3.208258 | 3.214617 | 3.215482 |
| 2.839078 | 2.765460 | 2.784825 | 2.823285 |
| 2.351375 | 2.357627 | 2.330183 | 2.408689 |
| 2.939162 | 2.923510 | 2.941383 | 2.935352 |
| 3.526361 | 3.483199 | 3.524351 | 3.512623 |
import matplotlib.pyplot as plt
def plot_actual_vs_predicted(comparison_df):
"""
Plots actual vs predicted values for all models with detailed styling.
Uses black background and '#eac086' color for plotting.
Parameters:
- comparison_df: DataFrame with actual vs predicted values.
"""
# Limit to the first 10 values for plotting
comparison_df = comparison_df.head(10)
plt.figure(figsize=(18, 6))
plt.style.use('dark_background') # Use a dark background for the plots
# Colors for each model
colors = ['#eac086', '#eac086', '#eac086'] # Same color for all models as requested
# Iterate through each model prediction to create subplots
for i, model in enumerate(['XGB Predicted', 'RandomForest Predicted', 'GradientBoosting Predicted']):
plt.subplot(1, 3, i + 1)
plt.scatter(comparison_df['Actual Body_Fat'], comparison_df[model],
color=colors[i], alpha=0.7, edgecolor='black', label=f'{model} Predictions', s=100) # Increase scatter size
plt.plot([comparison_df['Actual Body_Fat'].min(), comparison_df['Actual Body_Fat'].max()],
[comparison_df['Actual Body_Fat'].min(), comparison_df['Actual Body_Fat'].max()],
'--', lw=3, color='white', label='Perfect Fit') # Increase line width and set color
plt.xlabel('Actual Body Fat Percentage', fontsize=12)
plt.ylabel('Predicted Body Fat Percentage', fontsize=12)
plt.title(f'{model}: Actual vs Predicted', fontsize=14, color='white')
plt.legend()
# Add value labels for each point
for j in range(len(comparison_df)):
plt.text(comparison_df['Actual Body_Fat'].iloc[j], comparison_df[model].iloc[j],
f'{comparison_df[model].iloc[j]:.2f}', fontsize=10, color='white')
plt.tight_layout()
plt.show()
# Plot the comparison with the updated plotting function
plot_actual_vs_predicted(comparison_df)
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Data for plotting
results_df = pd.DataFrame({
'Model': ['XGBRegressor', 'RandomForestRegressor', 'GradientBoostingRegressor'],
'R^2': [0.989803, 0.980559, 0.983830],
'RMSE': [0.047240, 0.065228, 0.059489],
'MAE': [0.022924, 0.026300, 0.025341],
'MAPE': [0.008950, 0.010465, 0.010254],
'Median AE': [0.009602, 0.006151, 0.005138],
'Explained Variance Score': [0.989864, 0.980954, 0.983907]
})
# Set the style
plt.style.use('dark_background')
sns.set_palette(sns.color_palette(['#eac086', '#ffcd94', '#ffad60']))
# Define metrics
metrics = ['R^2', 'RMSE', 'MAE', 'MAPE', 'Median AE', 'Explained Variance Score']
# Create and save plots for each metric
for metric in metrics:
plt.figure(figsize=(12, 8))
ax = sns.barplot(x='Model', y=metric, data=results_df)
# Add data labels with white text
for container in ax.containers:
ax.bar_label(container, fmt='%.4f', fontsize=12, color='white')
# Set labels, title, and subtitle
plt.title(f'{metric} Comparison', fontsize=18, color='white')
plt.suptitle(f'Comparison of Models based on {metric}', fontsize=14, color='white', y=0.94)
plt.xlabel('Model', fontsize=14, color='white')
plt.ylabel(metric, fontsize=14, color='white')
# Set background color
plt.gca().set_facecolor('#000000')
# Save the figure
plt.savefig(f'{metric}_comparison.png', bbox_inches='tight')
# Show the plot
plt.show()
🏆 Best ML Model: XGBRegressor 🏆¶
After a detailed comparison of the three models—XGBRegressor, RandomForestRegressor, and GradientBoostingRegressor—the XGBRegressor is declared the winner based on the following reasoning:
1. Highest R² Score¶
- XGBRegressor has the highest R² Score of 0.9898, indicating that it explains approximately 98.98% of the variance in the data. This is marginally higher than the GradientBoostingRegressor (0.9879) and significantly better than the RandomForestRegressor (0.9803), demonstrating superior predictive power.
2. Lowest RMSE (Root Mean Squared Error)¶
- The RMSE for XGBRegressor is 0.0472, which is the lowest among the three models. This reflects that its predictions are closer to the actual values, indicating a high degree of accuracy. In contrast, GradientBoostingRegressor has a slightly higher RMSE of 0.0515, and RandomForestRegressor has a much higher RMSE of 0.0657.
3. Competitive MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error)¶
- The XGBRegressor achieves a MAE of 0.0229 and MAPE of 0.00895, which are very competitive. Although the GradientBoostingRegressor slightly edges out with a lower MAE of 0.0224 and an equal MAPE of 0.00895, the difference is minimal. The RandomForestRegressor has higher errors (MAE of 0.0264 and MAPE of 0.0103), indicating less accurate predictions.
4. Robustness in Median Absolute Error (Median AE)¶
- The XGBRegressor has a Median AE of 0.0096, which is marginally higher than the GradientBoostingRegressor (0.0046) but still indicates strong robustness. The RandomForestRegressor has a Median AE of 0.0073, but it does not compensate for its lower performance in other metrics.
5. Highest Explained Variance Score¶
- The Explained Variance Score for XGBRegressor is 0.9899, the highest among all three models. This score further validates that the XGBRegressor captures the underlying variance in the data most effectively, leading to more reliable predictions.
Conclusion:¶
While all three models perform well, the XGBRegressor stands out due to its combination of the highest R² score, lowest RMSE, competitive MAE and MAPE, and the highest explained variance score. These factors together make it the most robus Explained Variance Score of 0.9899, indicating a good fit to the data.